Red Wine Data Exploration by Ashley Adrias

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   :0.900  
##  1st Qu.: 7.100   1st Qu.:0.3900   1st Qu.:0.0900   1st Qu.:1.900  
##  Median : 7.900   Median :0.5200   Median :0.2500   Median :2.200  
##  Mean   : 8.258   Mean   :0.5287   Mean   :0.2661   Mean   :2.409  
##  3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.:2.600  
##  Max.   :13.200   Max.   :1.5800   Max.   :1.0000   Max.   :8.300  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 21.00      
##  Median :0.07900   Median :13.00       Median : 37.00      
##  Mean   :0.08701   Mean   :15.18       Mean   : 44.38      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 60.00      
##  Max.   :0.61100   Max.   :47.00       Max.   :143.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 52  
##  Median :0.9967   Median :3.310   Median :0.6200   Median :10.20   5:648  
##  Mean   :0.9967   Mean   :3.316   Mean   :0.6573   Mean   :10.43   6:614  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:190  
##  Max.   :1.0029   Max.   :4.010   Max.   :2.0000   Max.   :14.00   8: 18
## 'data.frame':    1532 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

To remove the impact of outliers, the top 1% from fixed acidity, residual sugar, free sulfur dioxide, and sulfur dioxide.

Univariate Plots Section

For this univariate section, we will create histograms to observe the distribution of each variable.

The Total Sulfur Dioxide, Free Sulfur Dioxide, and Sulphates variables show a long tail. This is why we will apply a log10 transform to produce a normal distribution. The variance does decreases in both cases, espcially for the sulfur distribution

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.532000e+03 0.000000e+00 0.000000e+00 6.000000e+00 1.430000e+02 
##        range          sum       median         mean      SE.mean 
## 1.370000e+02 6.799100e+04 3.700000e+01 4.438055e+01 7.572701e-01 
## CI.mean.0.95          var      std.dev     coef.var 
## 1.485396e+00 8.785376e+02 2.964014e+01 6.678632e-01
##      nbr.val     nbr.null       nbr.na          min          max 
## 1.532000e+03 0.000000e+00 0.000000e+00 7.781513e-01 2.155336e+00 
##        range          sum       median         mean      SE.mean 
## 1.377185e+00 2.374926e+03 1.568202e+00 1.550212e+00 7.621960e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 1.495059e-02 8.900043e-02 2.983294e-01 1.924442e-01

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.532000e+03 0.000000e+00 0.000000e+00 3.300000e-01 2.000000e+00 
##        range          sum       median         mean      SE.mean 
## 1.670000e+00 1.006990e+03 6.200000e-01 6.573042e-01 4.349794e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 8.532185e-03 2.898652e-02 1.702543e-01 2.590190e-01
##       nbr.val      nbr.null        nbr.na           min           max 
##  1.532000e+03  1.000000e+00  0.000000e+00 -4.814861e-01  3.010300e-01 
##         range           sum        median          mean       SE.mean 
##  7.825161e-01 -2.971731e+02 -2.076083e-01 -1.939772e-01  2.481604e-03 
##  CI.mean.0.95           var       std.dev      coef.var 
##  4.867703e-03  9.434607e-03  9.713191e-02 -5.007388e-01

The acidity (fixed and volatile) show a long tail. This is why we will apply a log10 transform to produce a normal distribution. The variance in fixed acidity deceased but did not decease for volatile acidity.

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.532000e+03 0.000000e+00 0.000000e+00 4.600000e+00 1.320000e+01 
##        range          sum       median         mean      SE.mean 
## 8.600000e+00 1.265170e+04 7.900000e+00 8.258290e+00 4.172367e-02 
## CI.mean.0.95          var      std.dev     coef.var 
## 8.184160e-02 2.667005e+00 1.633097e+00 1.977524e-01
##      nbr.val     nbr.null       nbr.na          min          max 
## 1.532000e+03 0.000000e+00 0.000000e+00 6.627578e-01 1.120574e+00 
##        range          sum       median         mean      SE.mean 
## 4.578161e-01 1.392269e+03 8.976271e-01 9.087915e-01 2.126268e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 4.170705e-03 6.926194e-03 8.322376e-02 9.157629e-02

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.532000e+03 0.000000e+00 0.000000e+00 1.200000e-01 1.580000e+00 
##        range          sum       median         mean      SE.mean 
## 1.460000e+00 8.099800e+02 5.200000e-01 5.287076e-01 4.540049e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 8.905373e-03 3.157766e-02 1.777010e-01 3.361046e-01
##       nbr.val      nbr.null        nbr.na           min           max 
##  1.532000e+03  3.000000e+00  0.000000e+00 -9.208188e-01  1.986571e-01 
##         range           sum        median          mean       SE.mean 
##  1.119476e+00 -4.629048e+02 -2.839967e-01 -3.021572e-01  3.887968e-03 
##  CI.mean.0.95           var       std.dev      coef.var 
##  7.626306e-03  2.315816e-02  1.521781e-01 -5.036387e-01

Wine quality is seperated into 3 bins; bad, average, and excellent. This is help classify wine quality. The histogram below shows that most wines fall into the average category.

##       bad   average excellent 
##        62      1262       208

Univariate Analysis

What is the structure of your dataset?

After removing the top 1% outliers from sulfur variables, sugar, and acidity varaibles, 1532 observations were left.

What is/are the main feature(s) of interest in your dataset?

The main feature for me is learning what makes wine excellent or bad. I would like to eventually use machine learning to predict what wines people will like given past preferences.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think that people have subjective opinions about the quality of wine due to their individual palette tastes and preferences. I would think that sugar, acidity, and alcohol could be clustered into wine quality. So people who prefer sweet, sour, and bitter could possibly be grouped together and grade wine quality similarily.

Did you create any new variables from existing variables in the dataset?

I created a categorical variable from the wine quality column where 5 or lower was assigned bad, 5-6 was assigned average, and 7 or higher was assigned excellent.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I cleaned up the data by removing the outliers(removing the top 1%) from sulfur variables, sugar, and acidity. Then I used a log10 transform on the sulphate, acidity, and sulfur data for turn them into normal distributions.

Bivariate Plots Section

To get a bird’s eye view of the dataset we will do a correlation plot between all the wine factors. This is visual show us how our variables are interacting from a bivariate standpoint.

The correlation coefficient for pH and fixed acidity is -0.67, meaning that pH tends to drop at fixed acidity increases, which makes sense, because a lower pH number does means that the substance is more acidic on the pH scale.

## [1] -0.6796961

The correlation between citric acid and pH is weaker(-0.52) than that of fixed acidity and pH. This makes sense because citric acid is a subset of fixed acidity.

## [1] -0.5277259

Acetic acid is volatile acid, which has a positive correlation with pH of 0.23. Volatile acid is gaseous and evaporate as the wine bottle remains open. This is what wine connoisseurs call airing, which allows the wine to breath. While the wine is airing, the pH level will increase, because the acidity is deceasing. However, the time that the wine was expose to air is unknown to these dataset. It would be interesting to see how airing time varies with pH.

## [1] 0.2385135

I am most interesting in seeing what variables affect wine ratings. By binning the ratings into bad, average, and excellent, we can classify wines by type and explore any correlations. Upon binning wine rating, we will explore how alcohol, pH, volatile acidity, citric acid, and sulphates affect the binned wine rating.

From the pH vs. wine rating boxplot shown below, we can see that on average, excellent rated wines have a lower pH value compared to bad rated wines. However, the difference between excellent and bad is small and difficult to say whether significant. Another thing to note is that the average and excellent ratings have similar distributions. The excellent distribution is within the average rating distribution. To improve the comparison, we could create smaller wine rating bins and/or survey more wines.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.303   3.380   3.385   3.500   3.900 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.210   3.315   3.316   3.408   4.010 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.280   3.295   3.380   3.780

From the alcohol percentage vs rating boxplot, we can observe a much greater difference between excellent and bad rating wines. Excellent rated wines have higher alcohol percentages. Bad and average rated wines are similar in alcohol percentages. The alcohol mean of excellent wines is 11.6% compared to that of bad which is 10.2%. It is also important to note that the entire excellent rated wine distribution is visually higher than the bad rated wine distribution.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.20   10.97   13.10 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.50   10.00   10.26   10.90   14.00 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.50   10.80   11.60   11.54   12.22   14.00

From the volatile acidity percentage vs rating boxplot, we can observe a much greater difference between excellent and bad rating wines. Excellent rated wines have lower volatile acidity percentages. Bad and average rated wines were not similar in volatile percentages. The trend shows that the lower the volatile acidity, the better the rating. This suggests that the wines should be throughly aired to allow the acetic acid to evaporate, in turn increases the pH value and rating. It would be interesting to see whether there are dimishing returns on quality if the wine is left to air out until no volatile acid is left.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5385  0.6400  1.3300 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

From the Citric Acid percentage vs Rating boxplot, we can observe a much greater difference between excellent and bad rating wines. Excellent rated wines have higher citric percentages. Bad and average rated wines were not similar in mean citric percentages. The trend shows that the higher the citric acidity, the better the rating. It would be interesting to see whether there are dimishing returns on quality at a certain citric acid percentage.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0750  0.1713  0.2675  1.0000 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2539  0.4000  0.7600 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.3950  0.3687  0.4900  0.7600

From the Sulphates percentage vs Rating boxplot, we can observe a much greater difference between excellent and bad rating wines. Excellent rated wines have higher Sulphates percentages. Bad and average rated wines were not similar in mean Sulphates percentages. The trend shows that the higher the Sulphates, the better the rating. It would be interesting to see whether there are dimishing returns on quality at a certain Sulphates percentage.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4925  0.5600  0.5927  0.6000  2.0000 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6461  0.7000  1.9800 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7444  0.8200  1.3600

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I was interested in what variables affect wine quality. It seemed to be that alcohol, volatile acidity, citric acid, and sulphates had the greatest affect on wine rating. In general, here were my findings:

Alcohol: The alcohol mean of excellent wines is 11.6% compared to that of bad which is 10.2%. It is also important to note that the entire excellent rated wine distribution is visually higher than the bad rated wine distribution.

Volatile Acidity: The trend shows that the lower the volatile acidity, the better the rating. This suggests that the wines should be throughly aired to allow the acetic acid to evaporate, in turn increases the pH value and rating. The mean volatile acidity for excellent rated wines was 0.409.

Citric Acid: The trend shows that the higher the citric acidity, the better the rating. The mean citric acidity for excellent rated wines was 0.490.

Sulphates: I can observe a much greater difference between excellent and bad rating wines. Excellent rated wines have higher Sulphates percentages. Bad and average rated wines were not similar in mean Sulphates percentages. The trend shows that the higher the Sulphates, the better the rating. The mean sulphate content for excellent rated wines was 0.820

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Fixed acidity and pH were negatively correlated, because a lower pH does mean more acidic.

What was the strongest relationship you found?

The strongest relationship was with alcohol. Alcohol to wine quality had a correlation value of 0.48783279.

Multivariate Plots Section

Upon completion of the bivariate analysis, it seems that the variables that affect wine rating are sulphates, citric acid, volatile acid, alchol, and pH. In this Multivariate section, we will compare the interaction of these factors on wine rating.

Right away, we can notice that there are a lot of more average rated wines that overlap when compared to bad and excellent rated wines. This is similar to what we saw in the boxplots. However, when comparing bad and excellent rated wines, we notice that all bad rated wines were less than ~12% alcohol and are around -.25log10(sulphates) or less. Whereas, many excellent rated wines had greater than 12% alcohol and are within -.25log10(sulphates) and 0. This comparison shows us how a higher alcohol content (>12%) and a sulphates content within -.25log10(sulphates) and 0 can result in average and excellent wine ratings. However, these 2 varaiables don’t adequetly differentiate between average and excellent ratings.

Again, we can notice that there are a lot of more average rated wines that overlap when compared to bad and excellent rated wines. However, when comparing bad and excellent rated wines, we notice that all bad rated wines were less than ~12% alcohol and are between 0-0.5 citric acid. Whereas, many excellent rated wines had greater than 12% alcohol and are within 0-0.75 citric acid. However, these 2 varaiables don’t adequetly differentiate between average and excellent ratings. I would also that that we dont have as many bad rated wine data points. Perhaps, if we had more than the data would look similar between all wine ratings.

It could be that the alcohol isn’t a big enough factor to visual see difference between ratings. So we will try sulphate and citric acid on ratings in the facet plot below.

We can notice that there are a lot of more average rated wines that overlap when compared to bad and excellent rated wines. However, when comparing bad and excellent rated wines, we notice that all bad rated wines were less than ~0.25 sulphates and less than 0.5 citric acid. Whereas, many excellent rated wines have -0.25 to 0 sulphates alcohol and less than 0.75 citric acid. However, these 2 varaiables don’t adequetly differentiate between average and excellent ratings.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4815 -0.3076 -0.2518 -0.2455 -0.2218  0.3010 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4318 -0.2676 -0.2147 -0.2011 -0.1549  0.2967 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.40894 -0.18709 -0.13077 -0.13514 -0.08619  0.13354

Next, we try volatile acidity and citric acid. We notice here that excellent rated wine have a citric acid content higher than ~0.3 and a volatile acid content of less than ~0.5.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5385  0.6400  1.3300 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

Here is see again the effects of volatile acidity. Lower volatile and higher sulphates trend toward an excellent rating.

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5385  0.6400  1.3300 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In the multivariate section we can now see how variables react together to get wine ratings. This would be interesting in order to build models. Here is summarized the rectangles I put around excellent and bad wine ratings:

Alcohol vs. Sulphates Excellent: Sulphates -0.25 to 0 and alcohol 10 - 13 Bad: Sulphates -0.375 to 0.125 and alcohol 9 - 12

Alcohol vs. Citric Acid Excellent: Citric 0 to .75 and alcohol 9 - 14 Bad: Citric 0 to 0.5 and alcohol 9 - 12

Citric Acid vs. Sulphates Excellent: Sulphates -0.25 to .0 and citric acid 0 - 0.75 Bad: Sulphates -0.375 to -0.125 and citric acid 0 - 0.5

Sulphates vs. Volatile Acidity Excellent: Sulphates -0.25 to .0 and Volatile acid 0.6 to 0.8 Bad: Sulphates -0.375 to -0.125 and Volatile acid 0.4 - 01.2

Were there any interesting or surprising interactions between features?

When I compare the max and min values of alcohol, sulphate, and citric acid, it seems to be that sulphates actually narrow the acceptable band of alcohol content to give an excellent rating. For example:

To get an excellent rating given the abdn of sulphates, alcohol had to be between 10-13. However with Citric Acid, the excellent alcohol badn was learger 9 to 14

Alcohol vs. Sulphates Excellent: Sulphates -0.25 to 0 and alcohol 10 - 13 Bad: Sulphates -0.375 to 0.125 and alcohol 9 - 12

Alcohol vs. Citric Acid Excellent: Citric 0 to .75 and alcohol 9 - 14 Bad: Citric 0 to 0.5 and alcohol 9 - 12


Final Plots and Summary

Plot One: Sulphates and Quality

## red_wine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4925  0.5600  0.5927  0.6000  2.0000 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6461  0.7000  1.9800 
## -------------------------------------------------------- 
## red_wine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7444  0.8200  1.3600

Description One

Excellent wines tend to have higher sulphate content.T he trend shows that the higher the Sulphates, the better the rating.

Plot Two: Alcohol & Sulphates vs. Quality

Description Two

The maximum and minimum are shown below for excellent and bad rated wines: Alcohol vs. Sulphates Excellent: Sulphates -0.25 to 0 and alcohol 10 - 13 Bad: Sulphates -0.375 to 0.125 and alcohol 9 - 12

In general, the higher the sulphates and alcohol content, the better the rating.

Plot Three: Volatile Acidity vs Quality

Description Three

The density graph shows exactly where we can find excellent quality wine with respect to volatile acidity. There is no chance of a wine being good if it has more than 1 g/dm^3 volatile acidity.


Reflection

The Struggle

I personnally through the time it took to read up on wine background, create all the plots, and then discuss. The entire project was quite large and took a lot of time. But I feel more confident in my knowledge about wine.

The Good

I liked this project because it taught me about wine and now I understand why it is important to air out the wine in order to remove the volatile acidity.

Future work

I would collect more data. There wasn’t enough bad points. It may have been because the people tasting the wines weren’t that experienced and so they just put average for wines that would have been bad otherwise. It would also be nice to know how long wines were aired out for and where the grapes were grown and where they were processed. Upon taking the machine learning course, I will build a predictive model.